DIL Workshop 👩‍💻👨‍💻

By Yukun & Mike
Part I - 05/08/2019

Jupyter Notebook Fundamentals 📒

Jupyter

  • A editable document including executable codes and notes
  • Each blcok is a CELL
  • Each CELL could be a chunk of code(defult) or text(Markdown)
  • A lot of functions to operate the CELLS: insert, delete, run, merge, move up, move down, etc.

How it works?

The notebook server, not the kernel, is responsible for saving and loading notebooks, so you can edit notebooks even if you don’t have the kernel for that language—you just won’t be able to run code. The kernel doesn’t know anything about the notebook document: it just gets sent cells of code to execute when the user runs them.


TL;DR
We interact with Browsers to communicate with Server which is built on a Kernel. A Kernel decides the type of the program you are running, e.g. Python, R, and Scala.

work


Demonstration of Basic Operations

In [3]:
a="DAB"
In [1]:
print("I love DIL")
I love DIL
In [5]:
print(a)
DAB
  • Look at the "In", that means the input of the cell
  • The number shows how many times you have run a cell in since you open this notebook
  • There are three states that one cell could have:
Appearence State
[] Cells to be run
[number] Cells that have been executed
[*] Cells that is running

Two Modes of Cells

  • Edit Mode: Run codes using the Kernel. When you see the pencil symbol on the upper right section of the browser and the color of cell is green, you should know you are in the Edit Mode.
  • Command Mode: Do things with the cells not the content of the cells in the Command Mode. The pencil is gone and the color of the block is blue in this mode.

How to change the mode efficiently?

  1. Double click the cell and then you are in the Edit Mode
  2. Use Enter to enter the Edit Mode
  3. Use Ctrl+Enter for Command Mode
  4. Press ESC to change from Edit Mode to Command Mode

Bringing the Best out of Jupyter Notebooks: More Advanced Operations

Your Juputer Savior: Shortcuts

  • How to find them: The Keyboard on the toolbar: "Open command palette"
  • Shortcut of finding shortcuts: Ctrl + Shift + P or Cmd + Shift + P
  • Personal Favorites
Shortcut Function
ESC Go to the Command Mode
Ctrl+Enter Run the Cell
Ctrl+Shift+Enter Run the Cell and Select the Cell Below
CM+A Insert A Cell Above
CM+B Insert A Cell Below
CM+M Change the Current Cell to Markdown Cell
CM+Y Change the Current Cell to Code Cell
CM+F Search and Replace
CM+M Merge the Cells
Shift+Up/Down Selcet Multiple Cells
Tab Code Auto-fill

shortcut

Download the Notebook

There are many formats for you to download.

File - Download As - ...

My personal Fav. Take 2:

  • Jupyter Notebook - .ipynb
  • Python - .py (Note that the format would be ugly since all the "In" and "Out" will be retained as comments)
  • HTML - .html
  • Markdown -.md
  • LaTex -.tex
  • PDF -.pdf(Note that the appearence of the code would change dramatically, and some of them would be truncated)

Extentions

  1. Install the package
    pip install jupyter_contrib_nbextensions or conda install -c conda-forge jupyter_contrib_nbextensions
  1. Install javascript and css files In your command mode, type
    jupyter contrib nbextension install --user
  1. Open your Jupyter and Select the Extensions You Need

My Personal Favorites Take 3:

  • Hinterland : Code autocompletion menu for every keypress in a code cell
  • Table of Contents (2): The toc2 extension enables to collect all running headers and display them in a floating window, as a sidebar or with a navigation menu.
  • Autopep8: Use kernel-specific code to reformat/prettify the contents of code cells
  • highlighter: Enable to highlight select text in a markdown cell
  • Snippets: Automatically generate some code templates.

For example, let's try the Snippets

In [ ]:
from __future__ import print_function, division
import numpy as np

Shell Commands

You can run the shell commands in Unix-like system, such as Linux or Mac OSX in Jupyter Notebook too!

In [24]:
%ls
 驱动器 C 中的卷是 OS
 卷的序列号是 A8ED-E1F4

 C:\Users\Jayden x Reich\Desktop\Digital Innovation Lab\silent-sam-twitter\DevBooks\workshop 的目录

05/05/2019  06:29 PM    <DIR>          .
05/05/2019  06:29 PM    <DIR>          ..
05/05/2019  04:39 PM    <DIR>          .ipynb_checkpoints
05/05/2019  06:06 PM                14 DIL.txt
05/05/2019  06:29 PM             8,771 Workshop Pt.1.ipynb
               2 个文件          8,785 字节
               3 个目录  7,733,256,192 可用字节
In [27]:
%pwd
Out[27]:
'C:\\Users\\Jayden x Reich\\Desktop\\Digital Innovation Lab\\silent-sam-twitter\\DevBooks\\workshop'

Magic Functions

Some of the enhancements that IPython adds on top of the normal Python syntax. All of them begin with %

There are two kinds of magic functions:

  1. Line Magics: The function only applies to the line
  2. Cell

List all the magic Functions

In [34]:
%lsmagic
Out[34]:
Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Running External Code

In [32]:
%run p.py
2

One of the most commonly used one: %matplotlib inline

In [36]:
%matplotlib inline

If you want to have vector image(the one won't be blurred when zooming), try this:
set %config InlineBackend.figure_format = 'svg'

In [37]:
%config InlineBackend.figure_format = 'svg'

Calculate the Time for Running Current Cell

In [48]:
%%time
a=0
for i in range(220000):
    a+=1 
Wall time: 17 ms

Help!

There are several ways you can get references inside the Jupyter Notebook

  1. Using the ? question mark
In [42]:
?print

Docstring: print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream. Type: builtin_function_or_method

So, the result would be the docstring of the function.

  1. Use the Help Tool on the top

They have references for:

  • Python
  • IPython
  • Markdown
  • Numpy
  • Scipy
  • Matplotlib
  • Pandas ...

Oxford Jupyter Comma

Usually, you can print the value of a statement without print in Jupyter. (Of course, that's why it's useful for interactive programming)

In [51]:
a=6
a
Out[51]:
6

By adding comma to the end of the line, it would not show the value of the variable or the statement

In [53]:
a+6;

Markdown 101 ✒️

What is Markdown

Markdown is a lightweight markup language with plain text formatting syntax. Its design allows it to be converted to many output formats, but the original tool by the same name only supports HTML.

TL;DR
Markdown is for text editting and supports everything that supports HTML

ALL THE WORDS IN THIS WORKSHOP DOCUMENT ARE WROTE IN MARKDOWN SYNTAX

There are many markdown rules, and I'm also going to pick out my personal favorites. :)

In [58]:
# H1
## H2
### H3
#### H4
##### H5
###### H6

H1

H2

H3

H4

H5
H6

Font Style

Emphasis, aka italics, with asterisks or underscores.

Strong emphasis, aka bold, with asterisks or underscores.

Combined emphasis with asterisks and underscores.

Strikethrough uses two tildes. Scratch this.

Lists

  1. First ordered list item
  2. Another item ⋅⋅* Unordered sub-list.
  3. Actual numbers don't matter, just that it's a number ⋅⋅1. Ordered sub-list
  4. And another item.
  1. First ordered list item
  2. Another item
    • Unordered sub-list.
  3. Actual numbers don't matter, just that it's a number
    1. Ordered sub-list
  4. And another item.

Images

DIL

DIL

The things in the [] is the name of the fig, while the URL shoudl be put into the (). The hovering name of the fig should be put after the URL with a whitespace in between.

But be careful when you insert images using Markdown, there are pitfalls...

Tables

D I L
DigitalDigital InnovationInnovation LabLab
D I L
Digital Digital Innovation Innovation Lab Lab

The first line is the header.
The second line indiciates the alignment.
The value of the table begins at 3rd line

There must be pipes to segement each cell of the table.

Quotes

This section is basically adapted from this site

This section is basically adapted from this site

Line Breaks



There are some other syntax like making a citation and references...But I personally think that is cumbersome and not very useful for us. If you want to know more about Markdown, please Google some relevant material online.

P.S. These comprehension statements could also be applied to Dictionaries and Tuples

Intermediate Python 🐍

Assuming everyone has a basic grasp of List, Dictionaries, Tuples, Functions, Loops, Conditional Statements, etc.
If you are not sure about these things, please feel free to ask 😊

List Comprehension

Create a list from a sequence based on a condition
Syntax: [<expr> for <item> in <seq> if <cond>]

In [78]:
%%time
t =[]
for i in range(100000):
    if i%2 == 0:
        t.append(i)
Wall time: 1.11 s
In [80]:
%%time
t=[i for i in range(100000) if i%2==0]
Wall time: 18 ms

Zip

Imagine I have a list a [1,2,3] and a list b ["A","B","C"]. I want a new list to have each of them combine as a new item. What should I do?

In [82]:
a=[1,2,3]
b=["A","B","C"]
c=[]
for itemfroma, itemfromb in zip(a,b):
    print(itemfroma,itemfromb)
    c.append((itemfroma,itemfromb))

print(c)
1 A
2 B
3 C
[(1, 'A'), (2, 'B'), (3, 'C')]

Key-value reverse

In [83]:
a={"d":1, "i":2, "l":3}
{value:key for key,value in a.items()}
Out[83]:
{1: 'd', 2: 'i', 3: 'l'}

Regular Expression

cheetsheet

In [85]:
import re
a = "199987659955"
match = re.search(r'9+',a)
print (match.group())
match = re.search(r'19*',a)
print (match.group())
match = re.search(r'9*',a)
print (match.group())
match = re.search(r'9+.*9+',a)
print (match.group())
999
1999

999876599
In [87]:
import re
match = re.search(r'pi+', 'piiig')
print (match.group())
match = re.search(r'i+', 'piigiiii')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print (match.group())
piii
ii
1 2 3
12 3
123
In [86]:
import re
a = "A765-2781-ZFQ"
match = re.search(r'([AB])([0-9]+)-([0-9]+)-([A-Z0-9]+)',a)
print (match.group())
print (match.group(1))
print (match.group(2))
print (match.group(3))
print (match.group(4))
A765-2781-ZFQ
A
765
2781
ZFQ

More Complex Data Structures and Intermediate Hacking Operations That I Really Want to Teach But I don't Have Time to Do So

  • Counter
  • Default Dictionary
  • Set
  • Enumerator
  • Item Getter
  • Map, Filter, Reduce
  • Lambda
  • Ternary Operators

But we will some some of them in the following content 😉

Practical-Oriented Data Wrangling and Visualization 🐼

It means all we gonna learn is for actual use not thorough understanding of the underlying mechanism, but first of all:
What are Pandas and Numpy

NumPy is a powerful python library that expands Python’s functionality by allowing users to create multi-dimenional array objects (ndarray). In addition to the creation of ndarray objects, NumPy provides a large set of mathematical functions that can operate quickly on the entries of the ndarray without the need of for loops.

The pandas (PANel + DAta) Python library allows for easy and fast data analysis and manipulation tools by providing numerical tables and time series data structures called DataFrame and Series, respectively. Pandas was created to do the following: provide data structures that can handle both time and non-time series data; allow mathematical operations on the data structures, ignoring the metadata of the data structures; use relational operations like those found in programming languages like SQL (join, group by, etc.); handle missing data

TL;DR

  • NumPy is about arrays, the advanced lists
  • Pandas is about tables, like Excel in the Python

Setup

In [1]:
# Those abbrs. are traditions
import numpy as np
import pandas as pd

Read Files

pandas basically supports all kinds of files, as long as they are format in a tabluar way

  • pd.read_csv()
  • pd.read_excel()
  • pd.read_table()
  • pd.read_json()

Let's have a look at our dataset. Remeber the SHELL commands we just mentioned before?

In [3]:
pd.read_csv("Static Tweets with Norm Loc.csv", index_col="Unnamed: 0").head(3)
c:\python36\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (0,22,41) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[3]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
0 1.099692e+18 miriammarkfield RT @jordangreentcb: I filed my stories about t... 2019-02-24 15:24:51 NaN en NaN 1.022726e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client North Carolina North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W True
1 1.099563e+18 1st_Reduce_Harm RT @jordangreentcb: #silentsam https://t.co/55... 2019-02-24 06:53:50 NaN en NaN 9.645217e+17 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Not "the Midwest", THE NORTH. Not "the Midwest", THE NORTH. Midwest, Natrona County, Wyoming, USA 43.411391 -106.280075 43 24m 41.0076s N, 106 16m 48.27s W False
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

In [2]:
db=pd.read_csv("Static Tweets with Norm Loc.csv", index_col="Unnamed: 0")
user=pd.read_csv("Users with Nor Loc.csv", index_col="Unnamed: 0")
c:\python36\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (0,22,41) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
c:\python36\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (0,3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Data Overview

.info() could show you the structure fo the dataframe, including the indexes, the columns, and their datatype. It is very common to use this function everytime you begin processing a dataset.

In [109]:
db.info()
<class 'pandas.core.frame.DataFrame'>
Index: 59218 entries, 0 to 59253
Data columns (total 41 columns):
id_str                       59217 non-null float64
from_user                    59217 non-null object
text                         59217 non-null object
time                         59216 non-null object
geo_coordinates              14 non-null object
user_lang                    59216 non-null object
in_reply_to_screen_name      2916 non-null object
from_user_id_str             59216 non-null float64
in_reply_to_status_id_str    2490 non-null float64
source                       59216 non-null object
user_followers_count         59183 non-null float64
user_friends_count           59115 non-null float64
user_location                45261 non-null object
entities_str                 59216 non-null object
place                        908 non-null object
retweet_count                51963 non-null float64
favorite_count               9274 non-null float64
user_description             54458 non-null object
user_created_at              59216 non-null object
user_geo_enabled             25624 non-null object
user_listed_count            52910 non-null float64
user_verified                59216 non-null object
user_statuses_count          59216 non-null float64
user_screen_name             59216 non-null object
user_favourites_count        59074 non-null float64
possibly_sensitive           232 non-null object
lang_trans                   59216 non-null object
date                         59216 non-null object
outlier_is                   59216 non-null float64
outlier_dbs                  59216 non-null float64
year                         59216 non-null float64
yearmon                      59216 non-null object
month                        59216 non-null float64
trans_sour                   59216 non-null object
pre                          38247 non-null object
no                           38246 non-null object
addr                         38240 non-null object
lat                          38240 non-null float64
long                         38240 non-null float64
point_x                      38240 non-null object
Accurarcy                    752 non-null object
dtypes: float64(16), object(25)
memory usage: 19.0+ MB

.describe() gives you the distributions of some numeric variables. It would show you the count, the mean, the standard deviation, and some other metrics.

In [110]:
user.describe()
Out[110]:
tweets_num from_user_id_str in_reply_to_status_id_str user_followers_count user_friends_count retweet_count favorite_count user_listed_count user_statuses_count user_favourites_count outlier_is outlier_dbs isoutlier_if created_days lat long
count 26098.000000 2.609700e+04 4.500000e+02 2.607700e+04 26047.000000 23660.000000 1977.000000 22346.000000 2.609700e+04 26033.000000 26097.000000 26097.000000 26097.000000 26097.000000 16034.000000 16034.000000
mean 2.265938 2.016850e+17 1.029731e+18 3.538700e+03 2168.344531 542.000296 15.194740 71.849593 4.443468e+04 37552.580955 0.824194 143.518182 0.799977 2247.721577 36.483251 -73.527622
std 12.392065 3.709054e+17 5.870383e+16 2.910858e+04 6297.468594 776.933658 151.958553 339.113175 7.759400e+04 63368.871498 0.566318 113.768807 0.600042 1143.276316 12.066288 46.359005
min -79.562596 1.737000e+03 6.960835e+17 1.000000e+00 1.000000 1.000000 1.000000 1.000000 1.000000e+00 1.000000 -1.000000 -1.000000 -1.000000 30.000000 -79.406307 -170.695975
25% 1.000000 1.227112e+08 1.031731e+18 2.050000e+02 336.000000 22.000000 1.000000 4.000000 4.473000e+03 3320.000000 1.000000 24.000000 1.000000 1215.000000 35.227087 -95.367697
50% 1.000000 7.547373e+08 1.032009e+18 5.940000e+02 793.000000 162.000000 2.000000 13.000000 1.610100e+04 13803.000000 1.000000 144.000000 1.000000 2396.000000 36.852984 -79.763550
75% 1.000000 4.166714e+09 1.069622e+18 1.916000e+03 2055.000000 685.000000 6.000000 50.000000 5.012800e+04 43905.000000 1.000000 243.000000 1.000000 3283.000000 40.730862 -76.938207
max 1271.000000 1.093451e+18 1.100848e+18 2.149919e+06 453537.000000 2566.000000 4953.000000 17022.000000 2.144185e+06 969558.000000 1.000000 341.000000 1.000000 4618.000000 70.399627 175.368412

Data Indexing and Selection

Select Rows

Rows are accessed by Index
Index are unqiue identifiers of the rows
Usually, they are just numbers

List all the indexes in the dataframe

In [108]:
db.index
Out[108]:
Index([  '0',   '1',   '2',   '3',   '4',   '5',   '6',   '7',   '8',   '9',
       ...
       59244, 59245, 59246, 59247, 59248, 59249, 59250, 59251, 59252, 59253],
      dtype='object', length=59218)

If you want to select a row, using .loc function.

In [113]:
db.loc["1"]
Out[113]:
id_str                                                             1.09956e+18
from_user                                                      1st_Reduce_Harm
text                         RT @jordangreentcb: #silentsam https://t.co/55...
time                                                       2019-02-24 06:53:50
geo_coordinates                                                            NaN
user_lang                                                                   en
in_reply_to_screen_name                                                    NaN
from_user_id_str                                                   9.64522e+17
in_reply_to_status_id_str                                                  NaN
source                       <a href="http://twitter.com/download/android" ...
user_followers_count                                                        12
user_friends_count                                                          98
user_location                                    Not "the Midwest", THE NORTH.
entities_str                 {"hashtags":[{"text":"silentsam","indices":[20...
place                                                                      NaN
retweet_count                                                                3
favorite_count                                                             NaN
user_description                                    Evidence based everything.
user_created_at                                 Fri Feb 16 15:27:57 +0000 2018
user_geo_enabled                                                           NaN
user_listed_count                                                          NaN
user_verified                                                            False
user_statuses_count                                                        818
user_screen_name                                               1st_Reduce_Harm
user_favourites_count                                                     1947
possibly_sensitive                                                         NaN
lang_trans                                                             English
date                                                       2019-02-24 00:00:00
outlier_is                                                                   1
outlier_dbs                                                                  1
year                                                                      2019
yearmon                                                                2019-02
month                                                                        2
trans_sour                                                 Twitter for Android
pre                                              Not "the Midwest", THE NORTH.
no                                               Not "the Midwest", THE NORTH.
addr                                     Midwest, Natrona County, Wyoming, USA
lat                                                                    43.4114
long                                                                   -106.28
point_x                                    43 24m 41.0076s N, 106 16m 48.27s W
Accurarcy                                                                False
Name: 1, dtype: object

The loc attribute allows indexing and slicing that always references the explicit index. So, you have to make sure the index you want to access is actually in the dataset. For example, in our dataset, the index is string, not integer, so the following command would return an error.

In [114]:
db.loc[1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\python36\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1789                 if not ax.contains(key):
-> 1790                     error()
   1791             except TypeError as e:

c:\python36\lib\site-packages\pandas\core\indexing.py in error()
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 

KeyError: 'the label [1] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-114-01cc2b216229> in <module>()
----> 1 db.loc[1]

c:\python36\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):

c:\python36\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1909 
   1910         # fall thru to straight lookup
-> 1911         self._validate_key(key, axis)
   1912         return self._get_label(key, axis=axis)
   1913 

c:\python36\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1796                 raise
   1797             except:
-> 1798                 error()
   1799 
   1800     def _is_scalar_access(self, key):

c:\python36\lib\site-packages\pandas\core\indexing.py in error()
   1783                 raise KeyError(u"the label [{key}] is not in the [{axis}]"
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 
   1787             try:

KeyError: 'the label [1] is not in the [index]'

The select of rows is identical to operate an numpy array. That is to say, you could slice the index in multifurious ways.

In [115]:
#select row 2 to row 4
db.loc["2":"4"]
Out[115]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
3 1.099630e+18 IGD_News RT @adaure: Drowned out by the chants of “go h... 2019-02-24 11:18:06 NaN en NaN 3.289440e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
4 1.099634e+18 tartnyc RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 11:35:24 NaN en NaN 1.596175e+08 NaN <a href="https://mobile.twitter.com" rel="nofo... ... 2019-02 2.0 Twitter Web App NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

In [118]:
#select multiple rows at a time
db.loc[["6","2","8"]]
Out[118]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
6 1.099965e+18 bluemazatl RT @jordangreentcb: The coalition of neo-Confe... 2019-02-25 09:30:09 NaN en NaN 2.365186e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Valsetz, Oregon Valsetz, Oregon Valsetz, Polk County, Oregon, USA 44.836235 -123.651337 44 50m 10.4464s N, 123 39m 4.81248s W True
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
8 1.100041e+18 brklwyr In awe of the great print/online work this yea... 2019-02-25 14:33:48 NaN en NaN 3.304751e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

In summary, the things inside .loc[] could only be one of these three things:

  • A single label, like "2"
  • A list of labels, like ['2','8','6']
  • A sequence of labeles, like 'a':"c"

Besides of using explict labels, we could also using the positional index to access a row
In this scenario, we use the .iloc function

In [122]:
db.iloc[2]
Out[122]:
id_str                                                             1.09973e+18
from_user                                                         SilentSamIAm
text                         RT @jordangreentcb: Antiracists tell neo-Confe...
time                                                       2019-02-24 17:42:29
geo_coordinates                                                            NaN
user_lang                                                                   en
in_reply_to_screen_name                                                    NaN
from_user_id_str                                                   9.13775e+17
in_reply_to_status_id_str                                                  NaN
source                       <a href="http://twitter.com/download/iphone" r...
user_followers_count                                                       373
user_friends_count                                                         146
user_location                                                              NaN
entities_str                 {"hashtags":[{"text":"SilentSam","indices":[91...
place                                                                      NaN
retweet_count                                                               28
favorite_count                                                             NaN
user_description                                       https://t.co/xo9rEtNhC4
user_created_at                                 Fri Sep 29 14:39:39 +0000 2017
user_geo_enabled                                                           NaN
user_listed_count                                                            2
user_verified                                                            False
user_statuses_count                                                       2665
user_screen_name                                                  SilentSamIAm
user_favourites_count                                                     2585
possibly_sensitive                                                         NaN
lang_trans                                                             English
date                                                       2019-02-24 00:00:00
outlier_is                                                                   1
outlier_dbs                                                                  2
year                                                                      2019
yearmon                                                                2019-02
month                                                                        2
trans_sour                                                  Twitter for iPhone
pre                                                                        NaN
no                                                                         NaN
addr                                                                       NaN
lat                                                                        NaN
long                                                                       NaN
point_x                                                                    NaN
Accurarcy                                                                  NaN
Name: 2, dtype: object

In this situation, the things inside the iloc function could only be one of the followings:

  • A number
  • A list of numbers
  • A slice object
  • A Boolean list
  • A callable function

We will talk about the last two later

In [124]:
db.iloc[2:4]
Out[124]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
3 1.099630e+18 IGD_News RT @adaure: Drowned out by the chants of “go h... 2019-02-24 11:18:06 NaN en NaN 3.289440e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN

2 rows × 41 columns

In [125]:
db.iloc[[2,8,6]]
Out[125]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
8 1.100041e+18 brklwyr In awe of the great print/online work this yea... 2019-02-25 14:33:48 NaN en NaN 3.304751e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
6 1.099965e+18 bluemazatl RT @jordangreentcb: The coalition of neo-Confe... 2019-02-25 09:30:09 NaN en NaN 2.365186e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Valsetz, Oregon Valsetz, Oregon Valsetz, Polk County, Oregon, USA 44.836235 -123.651337 44 50m 10.4464s N, 123 39m 4.81248s W True

3 rows × 41 columns

Select Columns

Each column in pandas dataframe is called Series. These are some common ways to access Series

  • Using dot
In [126]:
db.from_user
Out[126]:
0        miriammarkfield
1        1st_Reduce_Harm
2           SilentSamIAm
3               IGD_News
4                tartnyc
5             DT_Sensual
6             bluemazatl
7             KendraElWa
8                brklwyr
9        cujowasagoodboy
10         M_Abdirizak93
11       talkingattheTV2
12       margaretcmaurer
13          takeactionch
14            dhosterman
15            LocoCravey
16              226press
17       constantnatalie
18               tmorman
19              amymorto
20         acthistreview
21       Dj_Sepultourist
22            tarheelesq
23       angry_barbaloot
24         condorscondor
25            chogmorson
26        AndreFastayrol
27       takethemdownnow
28            magus721rn
29           RichmondDoc
              ...       
59224    knightstivender
59225           Big_G_09
59226        CampusY_UNC
59227            CPJ_UNC
59228          laurenps_
59229    Robert_Smalls62
59230       WilsonforSBP
59231        _thepopshop
59232       WilsonforSBP
59233           Big_G_09
59234      calgaryhester
59235         cdwyer0213
59236               tndp
59237            Avilyst
59238         saveusalll
59239         UNChistory
59240       Schmocki_Boi
59241       Surreyissafe
59242            ljhhrpr
59243    TarHeelAltRight
59244            iah_unc
59245     TriangleEditor
59246           AshHeffe
59247    TarHeelAltRight
59248      niiikkkiii_mn
59249       WilsonforSBP
59250           viejas46
59251           LokoVybe
59252           haley_nm
59253         LocoCravey
Name: from_user, Length: 59218, dtype: object
  • Using Square Brackets
In [127]:
db['from_user']
Out[127]:
0        miriammarkfield
1        1st_Reduce_Harm
2           SilentSamIAm
3               IGD_News
4                tartnyc
5             DT_Sensual
6             bluemazatl
7             KendraElWa
8                brklwyr
9        cujowasagoodboy
10         M_Abdirizak93
11       talkingattheTV2
12       margaretcmaurer
13          takeactionch
14            dhosterman
15            LocoCravey
16              226press
17       constantnatalie
18               tmorman
19              amymorto
20         acthistreview
21       Dj_Sepultourist
22            tarheelesq
23       angry_barbaloot
24         condorscondor
25            chogmorson
26        AndreFastayrol
27       takethemdownnow
28            magus721rn
29           RichmondDoc
              ...       
59224    knightstivender
59225           Big_G_09
59226        CampusY_UNC
59227            CPJ_UNC
59228          laurenps_
59229    Robert_Smalls62
59230       WilsonforSBP
59231        _thepopshop
59232       WilsonforSBP
59233           Big_G_09
59234      calgaryhester
59235         cdwyer0213
59236               tndp
59237            Avilyst
59238         saveusalll
59239         UNChistory
59240       Schmocki_Boi
59241       Surreyissafe
59242            ljhhrpr
59243    TarHeelAltRight
59244            iah_unc
59245     TriangleEditor
59246           AshHeffe
59247    TarHeelAltRight
59248      niiikkkiii_mn
59249       WilsonforSBP
59250           viejas46
59251           LokoVybe
59252           haley_nm
59253         LocoCravey
Name: from_user, Length: 59218, dtype: object

Select Data Cell and More...

You can select rows and columns at the same time. One way is to use loc and iloc.

In [130]:
db.loc['2':'6',['from_user','time']]
Out[130]:
from_user time
2 SilentSamIAm 2019-02-24 17:42:29
3 IGD_News 2019-02-24 11:18:06
4 tartnyc 2019-02-24 11:35:24
5 DT_Sensual 2019-02-28 03:28:14
6 bluemazatl 2019-02-25 09:30:09
In [134]:
db.iloc[[6,9,10],2:5]
Out[134]:
text time geo_coordinates
6 RT @jordangreentcb: The coalition of neo-Confe... 2019-02-25 09:30:09 NaN
9 RT @adaure: Drowned out by the chants of “go h... 2019-02-24 11:19:38 NaN
10 RT @jordangreentcb: #silentsam https://t.co/55... 2019-02-24 11:43:40 NaN

You could also do it step by step. Select a column, then pinpoint the index, vice versa.

In [138]:
db['text']['2':'6']
Out[138]:
2    RT @jordangreentcb: Antiracists tell neo-Confe...
3    RT @adaure: Drowned out by the chants of “go h...
4    RT @jordangreentcb: Antiracists tell neo-Confe...
5    People that seek separation without tolerance ...
6    RT @jordangreentcb: The coalition of neo-Confe...
Name: text, dtype: object
In [140]:
db.loc['2':'6']['text']
Out[140]:
2    RT @jordangreentcb: Antiracists tell neo-Confe...
3    RT @adaure: Drowned out by the chants of “go h...
4    RT @jordangreentcb: Antiracists tell neo-Confe...
5    People that seek separation without tolerance ...
6    RT @jordangreentcb: The coalition of neo-Confe...
Name: text, dtype: object

Now, let's go for contiditonal selecting

For example, we want the records with more than 10 retweets

In [144]:
db.retweet_count>1000
Out[144]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
59224    False
59225    False
59226    False
59227    False
59228    False
59229    False
59230    False
59231    False
59232    False
59233    False
59234    False
59235    False
59236    False
59237    False
59238    False
59239    False
59240    False
59241    False
59242    False
59243    False
59244    False
59245    False
59246    False
59247    False
59248    False
59249    False
59250    False
59251    False
59252    False
59253    False
Name: retweet_count, Length: 59218, dtype: bool

Now we got a Series of Boolean values. Next, we just need to use this as an index for selection.

In [156]:
db.loc[db.retweet_count>1000,:]
Out[156]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
177 1.099048e+18 AkeemElmin RT @RaleighReporter: Probably isn't surprising... 2019-02-22 20:45:39 NaN en NaN 1.598616e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Eindhoven, The Netherlands Eindhoven, The Netherlands Eindhoven, Noord-Brabant, Nederland 51.439265 5.478633 51 26m 21.3533s N, 5 28m 43.0788s E True
249 1.097118e+18 28Haitiankids RT @RaleighReporter: Probably isn't surprising... 2019-02-17 12:57:02 NaN en NaN 1.952861e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Casselberry, Fl Casselberry, Fl Casselberry, Seminole County, Florida, 32707, USA 28.654276 -81.323791 28 39m 15.3936s N, 81 19m 25.6467s W True
250 1.097220e+18 chasejefferson_ RT @RaleighReporter: Probably isn't surprising... 2019-02-17 19:44:29 NaN en NaN 3.352377e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Memphis | DC Memphis | DC Compa???_a California s.a. - MEMPHIS, Calle 17... 4.619598 -74.093892 4 37m 10.5535s N, 74 5m 38.0123s W False
309 1.095656e+18 Prof_Kennedy RT @RaleighReporter: Probably isn't surprising... 2019-02-13 12:10:27 NaN en NaN 2.479750e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client Rhode Island Rhode Island Rhode Island, USA 41.796241 -71.599237 41 47m 46.4672s N, 71 35m 57.2539s W True
340 1.095102e+18 Misskatengo RT @RaleighReporter: Probably isn't surprising... 2019-02-11 23:28:04 NaN en NaN 5.232052e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client My lane My lane My Lane, Falls Township, Bucks County, Pennsyl... 40.193663 -74.797663 40 11m 37.1868s N, 74 47m 51.5868s W False
345 1.095146e+18 KellyRek RT @RaleighReporter: Probably isn't surprising... 2019-02-12 02:24:03 NaN en NaN 3.095588e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Arizona Arizona Arizona, USA 34.395342 -111.763276 34 23m 43.2312s N, 111 45m 47.7918s W True
351 1.094351e+18 antoniodivine RT @RaleighReporter: Probably isn't surprising... 2019-02-09 21:41:35 NaN en NaN 3.185836e+07 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Miami, FL, USA Miami, FL, USA Miami, Miami-Dade County, Florida, USA 25.774266 -80.193659 25 46m 27.3569s N, 80 11m 37.172s W True
353 1.094788e+18 _Claytron RT @RaleighReporter: Probably isn't surprising... 2019-02-11 02:39:35 NaN en NaN 3.160624e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Greater Israel Greater Israel Israel, Tettedze Street, Christian Village, Ga... 5.639444 -0.232406 5 38m 21.9971s N, 0 13m 56.6613s W False
357 1.095157e+18 wrongestwrong RT @RaleighReporter: Probably isn't surprising... 2019-02-12 03:04:36 NaN en NaN 9.167096e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
360 1.094821e+18 trapp_21 RT @RaleighReporter: Probably isn't surprising... 2019-02-11 04:48:57 NaN en NaN 4.197632e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Charlotte, NC Charlotte, NC Charlotte, Mecklenburg County, North Carolina,... 35.227087 -80.843127 35 13m 37.5128s N, 80 50m 35.2565s W True
361 1.095104e+18 Rashidbelike RT @RaleighReporter: Probably isn't surprising... 2019-02-11 23:35:40 NaN en NaN 7.078800e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
368 1.095148e+18 ZaRdOz420WPN RT @RaleighReporter: Probably isn't surprising... 2019-02-12 02:31:11 NaN en NaN 7.281802e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Appalachia Appalachia Appalachia, Wise County, Virginia, 24216, USA 36.906763 -82.781828 36 54m 24.3472s N, 82 46m 54.579s W True
378 1.095437e+18 SYMONERAIDER RT @RaleighReporter: Probably isn't surprising... 2019-02-12 21:39:22 NaN en NaN 8.816903e+17 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2019-02 2.0 Twitter for iPad United States USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W True
380 1.094347e+18 knight_dlx RT @RaleighReporter: Probably isn't surprising... 2019-02-09 21:25:41 NaN en NaN 8.292396e+07 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android online online Online, Croix d'Argent, Montpellier, H??rault,... 43.590472 3.859513 43 35m 25.6987s N, 3 51m 34.2476s E False
381 1.094306e+18 mfdaan RT @RaleighReporter: Probably isn't surprising... 2019-02-09 18:42:26 NaN en NaN 3.511230e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
393 1.094487e+18 deuceohsixx RT @RaleighReporter: Probably isn't surprising... 2019-02-10 06:42:34 NaN en NaN 3.902784e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Belltown, Seattle, Washington Belltown, Seattle, Washington Belltown, Seattle, King County, Washington, 98... 47.613231 -122.345361 47 36m 47.632s N, 122 20m 43.2985s W True
395 1.094784e+18 WayneLev1 RT @RaleighReporter: Probably isn't surprising... 2019-02-11 02:22:56 NaN en NaN 7.259834e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android South Florida South Florida South Florida, Columbus, Cherokee County, Kans... 37.167713 -94.846426 37 10m 3.76608s N, 94 50m 47.1322s W False
402 1.093862e+18 21ponky RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:21:36 NaN en NaN 4.621058e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone miami miami Miami, Miami-Dade County, Florida, USA 25.774266 -80.193659 25 46m 27.3569s N, 80 11m 37.172s W True
403 1.093865e+18 emulatelife RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:29:56 NaN en NaN 2.615801e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client Atlanta Atlanta Atlanta, Fulton County, Georgia, USA 33.749099 -84.390185 33 44m 56.7553s N, 84 23m 24.6656s W True
404 1.093974e+18 mighty_bee_ RT @RaleighReporter: Probably isn't surprising... 2019-02-08 20:45:41 NaN en NaN 3.684749e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone heaven only knows heaven only knows Heaven Only Knows Fashioins, 2630, Bourquin Cr... 49.049827 -122.309137 49 2m 59.3765s N, 122 18m 32.8939s W False
405 1.093922e+18 yarieliz03 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 17:17:01 NaN en NaN 1.393197e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
407 1.093928e+18 BarcoLemuel RT @RaleighReporter: Probably isn't surprising... 2019-02-08 17:43:55 NaN en NaN 1.040304e+18 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Preston Hollow, Dallas Preston Hollow, Dallas Preston Hollow United Methodist, 6315, Walnut ... 32.880687 -96.797175 32 52m 50.473s N, 96 47m 49.8306s W True
408 1.093966e+18 SimphiweFigo RT @RaleighReporter: Probably isn't surprising... 2019-02-08 20:12:33 NaN en NaN 3.396541e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
409 1.094255e+18 TR4d3_aLa_M RT @RaleighReporter: Probably isn't surprising... 2019-02-09 15:20:11 NaN en NaN 2.191550e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone 1804??_??_??_ 1804??_??_??_ Bod??, Nordland, Norge 67.309478 13.915442 67 18m 34.1219s N, 13 54m 55.5912s E False
410 1.093856e+18 mYuSerNAmePolly RT @RaleighReporter: Probably isn't surprising... 2019-02-08 12:56:04 NaN en NaN 9.644784e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
411 1.093867e+18 starbucksgirl51 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:40:25 NaN en NaN 1.846359e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client NaN NaN NaN NaN NaN NaN NaN
412 1.094003e+18 KG_x24 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 22:38:36 NaN en NaN 3.568611e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
413 1.094037e+18 sydneyjoao_ RT @RaleighReporter: Probably isn't surprising... 2019-02-09 00:57:18 NaN en NaN 2.147814e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
414 1.094093e+18 JuiceGod23_ RT @RaleighReporter: Probably isn't surprising... 2019-02-09 04:38:15 NaN en NaN 1.162785e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Louisiana, USA Louisiana, USA Louisiana, USA 30.870388 -92.007126 30 52m 13.3972s N, 92 0m 25.6536s W True
415 1.094209e+18 wavyynavy RT @RaleighReporter: Probably isn't surprising... 2019-02-09 12:16:53 NaN en NaN 8.059785e+17 NaN <a href="https://mobile.twitter.com" rel="nofo... ... 2019-02 2.0 Twitter Web App NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49806 1.031719e+18 MTBinDurham RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:10 NaN en NaN 2.938242e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client Richmond, VA Richmond, VA Richmond, Richmond City, Virginia, 23298, USA 37.538509 -77.434280 37 32m 18.6313s N, 77 26m 3.408s W NaN
49813 1.031719e+18 tony82576 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:32 NaN en NaN 2.173281e+09 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49814 1.031719e+18 ExecutiveOtaku RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:04 NaN en NaN 6.078021e+07 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client USA USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W NaN
49818 1.031719e+18 thepattymatos RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:31 NaN en NaN 1.634988e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone hire me hire me Scooter Hire, Calle 41, Colonia Centro, Vallad... 20.689681 -88.200257 20 41m 22.8498s N, 88 12m 0.92556s W NaN
49822 1.031719e+18 mrgnvckrs RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:22 NaN en NaN 1.378743e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Denver, CO Denver, CO Denver, Denver County, Colorado, USA 39.739236 -104.984862 39 44m 21.251s N, 104 59m 5.50428s W NaN
49823 1.031719e+18 LisaMichaels1 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:46 NaN en NaN 3.689977e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49827 1.031719e+18 KyleyUnderhill RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:36 NaN en NaN 3.632784e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49834 1.031718e+18 WinstonSalemDSA RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:50 NaN en NaN 9.760895e+17 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2018-08 8.0 Twitter for iPad NaN NaN NaN NaN NaN NaN NaN
49838 1.031718e+18 _sierraalyse RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:24 NaN en NaN 2.227594e+09 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49852 1.031718e+18 AlyssaAnnBowen RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:41:55 NaN en NaN 8.062270e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49856 1.031718e+18 Max_Neill RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:52 NaN en NaN 3.911495e+08 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49857 1.031718e+18 p0undcake_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:40:54 NaN en NaN 2.207834e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone United States USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W NaN
49865 1.031718e+18 zach_goins RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:01 NaN en NaN 3.677972e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49868 1.031718e+18 rodbustamante_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:52 NaN en NaN 3.327441e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49874 1.031718e+18 _Cebron RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:23 NaN en NaN 6.365455e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone CLT CLT Charlotte-Douglas International Airport, Expre... 35.210741 -80.946021 35 12m 38.6692s N, 80 56m 45.6761s W NaN
49892 1.031718e+18 Nike_Bass95 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:12 NaN en NaN 3.387485e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Beyond the Ark Beyond the Ark Bed Bath & Beyond, Crest Lane, Pinnacle Hills ... 36.308840 -94.176802 36 18m 31.824s N, 94 10m 36.4861s W NaN
49894 1.031718e+18 writerafjanae RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:54 NaN en NaN 2.210950e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49912 1.031718e+18 c_nguyen98 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:41:02 NaN en NaN 2.511546e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49936 1.031717e+18 jekahben RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:11 NaN en NaN 7.677085e+17 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android Richmond, VA Richmond, VA Richmond, Richmond City, Virginia, 23298, USA 37.538509 -77.434280 37 32m 18.6313s N, 77 26m 3.408s W NaN
49940 1.031717e+18 Micchisaurus RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:19 NaN en NaN 2.324935e+09 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client NaN NaN NaN NaN NaN NaN NaN
49942 1.031717e+18 JamieBollWBTV RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:52 NaN en NaN 3.274613e+08 NaN <a href="https://about.twitter.com/products/tw... ... 2018-08 8.0 TweetDeck Charlotte, NC Charlotte, NC Charlotte, Mecklenburg County, North Carolina,... 35.227087 -80.843127 35 13m 37.5128s N, 80 50m 35.2565s W NaN
49970 1.031717e+18 HarleighDog RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:37:33 NaN en NaN 4.475245e+07 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client Durham, NC Durham, NC Durham County, North Carolina, USA 36.018132 -78.875158 36 1m 5.27376s N, 78 52m 30.5695s W NaN
49978 1.031717e+18 KateDahls RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:59 NaN en NaN 2.098773e+07 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2018-08 8.0 Twitter for iPad Athens, GA Athens, GA Athens, Athens-Clarke County, Georgia, 3033414... 33.959768 -83.376398 33 57m 35.1637s N, 83 22m 35.0328s W NaN
49986 1.031718e+18 britty_bap RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:40:09 NaN en NaN 1.069193e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
50023 1.031716e+18 DEdwardBeck RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:35:42 NaN en NaN 2.978284e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Brooklyn Brooklyn BK, Kings County, NYC, New York, 11226, USA 40.650104 -73.949582 40 39m 0.37368s N, 73 56m 58.4963s W NaN
50038 1.031716e+18 Brady_Creef RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:34:15 NaN en NaN 1.416237e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone 252 252 NU, Canada 56.572731 -79.562596 56 34m 21.8316s N, 79 33m 45.3439s W NaN
50072 1.031715e+18 jaylapa_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:31:51 NaN en NaN 2.277226e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone nc North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W NaN
50161 1.031715e+18 AnyaLogan RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:29:50 NaN en NaN 2.653554e+07 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NC North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W NaN
50174 1.031715e+18 sirtou2 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:29:19 NaN en NaN 7.181703e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
50197 1.031715e+18 RyMiko So long #SilentSam https://t.co/lBOkIprCxd 2018-08-21 02:28:26 NaN en NaN 4.183678e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN

4599 rows × 41 columns

In [159]:
db.loc[lambda x: x.retweet_count>1000]
Out[159]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
177 1.099048e+18 AkeemElmin RT @RaleighReporter: Probably isn't surprising... 2019-02-22 20:45:39 NaN en NaN 1.598616e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Eindhoven, The Netherlands Eindhoven, The Netherlands Eindhoven, Noord-Brabant, Nederland 51.439265 5.478633 51 26m 21.3533s N, 5 28m 43.0788s E True
249 1.097118e+18 28Haitiankids RT @RaleighReporter: Probably isn't surprising... 2019-02-17 12:57:02 NaN en NaN 1.952861e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Casselberry, Fl Casselberry, Fl Casselberry, Seminole County, Florida, 32707, USA 28.654276 -81.323791 28 39m 15.3936s N, 81 19m 25.6467s W True
250 1.097220e+18 chasejefferson_ RT @RaleighReporter: Probably isn't surprising... 2019-02-17 19:44:29 NaN en NaN 3.352377e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Memphis | DC Memphis | DC Compa???_a California s.a. - MEMPHIS, Calle 17... 4.619598 -74.093892 4 37m 10.5535s N, 74 5m 38.0123s W False
309 1.095656e+18 Prof_Kennedy RT @RaleighReporter: Probably isn't surprising... 2019-02-13 12:10:27 NaN en NaN 2.479750e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client Rhode Island Rhode Island Rhode Island, USA 41.796241 -71.599237 41 47m 46.4672s N, 71 35m 57.2539s W True
340 1.095102e+18 Misskatengo RT @RaleighReporter: Probably isn't surprising... 2019-02-11 23:28:04 NaN en NaN 5.232052e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client My lane My lane My Lane, Falls Township, Bucks County, Pennsyl... 40.193663 -74.797663 40 11m 37.1868s N, 74 47m 51.5868s W False
345 1.095146e+18 KellyRek RT @RaleighReporter: Probably isn't surprising... 2019-02-12 02:24:03 NaN en NaN 3.095588e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Arizona Arizona Arizona, USA 34.395342 -111.763276 34 23m 43.2312s N, 111 45m 47.7918s W True
351 1.094351e+18 antoniodivine RT @RaleighReporter: Probably isn't surprising... 2019-02-09 21:41:35 NaN en NaN 3.185836e+07 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Miami, FL, USA Miami, FL, USA Miami, Miami-Dade County, Florida, USA 25.774266 -80.193659 25 46m 27.3569s N, 80 11m 37.172s W True
353 1.094788e+18 _Claytron RT @RaleighReporter: Probably isn't surprising... 2019-02-11 02:39:35 NaN en NaN 3.160624e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Greater Israel Greater Israel Israel, Tettedze Street, Christian Village, Ga... 5.639444 -0.232406 5 38m 21.9971s N, 0 13m 56.6613s W False
357 1.095157e+18 wrongestwrong RT @RaleighReporter: Probably isn't surprising... 2019-02-12 03:04:36 NaN en NaN 9.167096e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
360 1.094821e+18 trapp_21 RT @RaleighReporter: Probably isn't surprising... 2019-02-11 04:48:57 NaN en NaN 4.197632e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Charlotte, NC Charlotte, NC Charlotte, Mecklenburg County, North Carolina,... 35.227087 -80.843127 35 13m 37.5128s N, 80 50m 35.2565s W True
361 1.095104e+18 Rashidbelike RT @RaleighReporter: Probably isn't surprising... 2019-02-11 23:35:40 NaN en NaN 7.078800e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
368 1.095148e+18 ZaRdOz420WPN RT @RaleighReporter: Probably isn't surprising... 2019-02-12 02:31:11 NaN en NaN 7.281802e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Appalachia Appalachia Appalachia, Wise County, Virginia, 24216, USA 36.906763 -82.781828 36 54m 24.3472s N, 82 46m 54.579s W True
378 1.095437e+18 SYMONERAIDER RT @RaleighReporter: Probably isn't surprising... 2019-02-12 21:39:22 NaN en NaN 8.816903e+17 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2019-02 2.0 Twitter for iPad United States USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W True
380 1.094347e+18 knight_dlx RT @RaleighReporter: Probably isn't surprising... 2019-02-09 21:25:41 NaN en NaN 8.292396e+07 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android online online Online, Croix d'Argent, Montpellier, H??rault,... 43.590472 3.859513 43 35m 25.6987s N, 3 51m 34.2476s E False
381 1.094306e+18 mfdaan RT @RaleighReporter: Probably isn't surprising... 2019-02-09 18:42:26 NaN en NaN 3.511230e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
393 1.094487e+18 deuceohsixx RT @RaleighReporter: Probably isn't surprising... 2019-02-10 06:42:34 NaN en NaN 3.902784e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Belltown, Seattle, Washington Belltown, Seattle, Washington Belltown, Seattle, King County, Washington, 98... 47.613231 -122.345361 47 36m 47.632s N, 122 20m 43.2985s W True
395 1.094784e+18 WayneLev1 RT @RaleighReporter: Probably isn't surprising... 2019-02-11 02:22:56 NaN en NaN 7.259834e+08 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android South Florida South Florida South Florida, Columbus, Cherokee County, Kans... 37.167713 -94.846426 37 10m 3.76608s N, 94 50m 47.1322s W False
402 1.093862e+18 21ponky RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:21:36 NaN en NaN 4.621058e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone miami miami Miami, Miami-Dade County, Florida, USA 25.774266 -80.193659 25 46m 27.3569s N, 80 11m 37.172s W True
403 1.093865e+18 emulatelife RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:29:56 NaN en NaN 2.615801e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client Atlanta Atlanta Atlanta, Fulton County, Georgia, USA 33.749099 -84.390185 33 44m 56.7553s N, 84 23m 24.6656s W True
404 1.093974e+18 mighty_bee_ RT @RaleighReporter: Probably isn't surprising... 2019-02-08 20:45:41 NaN en NaN 3.684749e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone heaven only knows heaven only knows Heaven Only Knows Fashioins, 2630, Bourquin Cr... 49.049827 -122.309137 49 2m 59.3765s N, 122 18m 32.8939s W False
405 1.093922e+18 yarieliz03 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 17:17:01 NaN en NaN 1.393197e+09 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
407 1.093928e+18 BarcoLemuel RT @RaleighReporter: Probably isn't surprising... 2019-02-08 17:43:55 NaN en NaN 1.040304e+18 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Preston Hollow, Dallas Preston Hollow, Dallas Preston Hollow United Methodist, 6315, Walnut ... 32.880687 -96.797175 32 52m 50.473s N, 96 47m 49.8306s W True
408 1.093966e+18 SimphiweFigo RT @RaleighReporter: Probably isn't surprising... 2019-02-08 20:12:33 NaN en NaN 3.396541e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
409 1.094255e+18 TR4d3_aLa_M RT @RaleighReporter: Probably isn't surprising... 2019-02-09 15:20:11 NaN en NaN 2.191550e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone 1804??_??_??_ 1804??_??_??_ Bod??, Nordland, Norge 67.309478 13.915442 67 18m 34.1219s N, 13 54m 55.5912s E False
410 1.093856e+18 mYuSerNAmePolly RT @RaleighReporter: Probably isn't surprising... 2019-02-08 12:56:04 NaN en NaN 9.644784e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
411 1.093867e+18 starbucksgirl51 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 13:40:25 NaN en NaN 1.846359e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client NaN NaN NaN NaN NaN NaN NaN
412 1.094003e+18 KG_x24 RT @RaleighReporter: Probably isn't surprising... 2019-02-08 22:38:36 NaN en NaN 3.568611e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
413 1.094037e+18 sydneyjoao_ RT @RaleighReporter: Probably isn't surprising... 2019-02-09 00:57:18 NaN en NaN 2.147814e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
414 1.094093e+18 JuiceGod23_ RT @RaleighReporter: Probably isn't surprising... 2019-02-09 04:38:15 NaN en NaN 1.162785e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone Louisiana, USA Louisiana, USA Louisiana, USA 30.870388 -92.007126 30 52m 13.3972s N, 92 0m 25.6536s W True
415 1.094209e+18 wavyynavy RT @RaleighReporter: Probably isn't surprising... 2019-02-09 12:16:53 NaN en NaN 8.059785e+17 NaN <a href="https://mobile.twitter.com" rel="nofo... ... 2019-02 2.0 Twitter Web App NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49806 1.031719e+18 MTBinDurham RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:10 NaN en NaN 2.938242e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client Richmond, VA Richmond, VA Richmond, Richmond City, Virginia, 23298, USA 37.538509 -77.434280 37 32m 18.6313s N, 77 26m 3.408s W NaN
49813 1.031719e+18 tony82576 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:32 NaN en NaN 2.173281e+09 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49814 1.031719e+18 ExecutiveOtaku RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:04 NaN en NaN 6.078021e+07 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client USA USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W NaN
49818 1.031719e+18 thepattymatos RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:31 NaN en NaN 1.634988e+07 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone hire me hire me Scooter Hire, Calle 41, Colonia Centro, Vallad... 20.689681 -88.200257 20 41m 22.8498s N, 88 12m 0.92556s W NaN
49822 1.031719e+18 mrgnvckrs RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:22 NaN en NaN 1.378743e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Denver, CO Denver, CO Denver, Denver County, Colorado, USA 39.739236 -104.984862 39 44m 21.251s N, 104 59m 5.50428s W NaN
49823 1.031719e+18 LisaMichaels1 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:44:46 NaN en NaN 3.689977e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49827 1.031719e+18 KyleyUnderhill RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:45:36 NaN en NaN 3.632784e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49834 1.031718e+18 WinstonSalemDSA RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:50 NaN en NaN 9.760895e+17 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2018-08 8.0 Twitter for iPad NaN NaN NaN NaN NaN NaN NaN
49838 1.031718e+18 _sierraalyse RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:24 NaN en NaN 2.227594e+09 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49852 1.031718e+18 AlyssaAnnBowen RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:41:55 NaN en NaN 8.062270e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49856 1.031718e+18 Max_Neill RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:52 NaN en NaN 3.911495e+08 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NaN NaN NaN NaN NaN NaN NaN
49857 1.031718e+18 p0undcake_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:40:54 NaN en NaN 2.207834e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone United States USA USA 39.783730 -100.445882 39 47m 1.42944s N, 100 26m 45.177s W NaN
49865 1.031718e+18 zach_goins RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:01 NaN en NaN 3.677972e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49868 1.031718e+18 rodbustamante_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:52 NaN en NaN 3.327441e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49874 1.031718e+18 _Cebron RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:23 NaN en NaN 6.365455e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone CLT CLT Charlotte-Douglas International Airport, Expre... 35.210741 -80.946021 35 12m 38.6692s N, 80 56m 45.6761s W NaN
49892 1.031718e+18 Nike_Bass95 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:43:12 NaN en NaN 3.387485e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Beyond the Ark Beyond the Ark Bed Bath & Beyond, Crest Lane, Pinnacle Hills ... 36.308840 -94.176802 36 18m 31.824s N, 94 10m 36.4861s W NaN
49894 1.031718e+18 writerafjanae RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:42:54 NaN en NaN 2.210950e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
49912 1.031718e+18 c_nguyen98 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:41:02 NaN en NaN 2.511546e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
49936 1.031717e+18 jekahben RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:11 NaN en NaN 7.677085e+17 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android Richmond, VA Richmond, VA Richmond, Richmond City, Virginia, 23298, USA 37.538509 -77.434280 37 32m 18.6313s N, 77 26m 3.408s W NaN
49940 1.031717e+18 Micchisaurus RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:19 NaN en NaN 2.324935e+09 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client NaN NaN NaN NaN NaN NaN NaN
49942 1.031717e+18 JamieBollWBTV RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:52 NaN en NaN 3.274613e+08 NaN <a href="https://about.twitter.com/products/tw... ... 2018-08 8.0 TweetDeck Charlotte, NC Charlotte, NC Charlotte, Mecklenburg County, North Carolina,... 35.227087 -80.843127 35 13m 37.5128s N, 80 50m 35.2565s W NaN
49970 1.031717e+18 HarleighDog RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:37:33 NaN en NaN 4.475245e+07 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2018-08 8.0 Twitter Web Client Durham, NC Durham, NC Durham County, North Carolina, USA 36.018132 -78.875158 36 1m 5.27376s N, 78 52m 30.5695s W NaN
49978 1.031717e+18 KateDahls RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:38:59 NaN en NaN 2.098773e+07 NaN <a href="http://twitter.com/#!/download/ipad" ... ... 2018-08 8.0 Twitter for iPad Athens, GA Athens, GA Athens, Athens-Clarke County, Georgia, 3033414... 33.959768 -83.376398 33 57m 35.1637s N, 83 22m 35.0328s W NaN
49986 1.031718e+18 britty_bap RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:40:09 NaN en NaN 1.069193e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
50023 1.031716e+18 DEdwardBeck RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:35:42 NaN en NaN 2.978284e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone Brooklyn Brooklyn BK, Kings County, NYC, New York, 11226, USA 40.650104 -73.949582 40 39m 0.37368s N, 73 56m 58.4963s W NaN
50038 1.031716e+18 Brady_Creef RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:34:15 NaN en NaN 1.416237e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone 252 252 NU, Canada 56.572731 -79.562596 56 34m 21.8316s N, 79 33m 45.3439s W NaN
50072 1.031715e+18 jaylapa_ RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:31:51 NaN en NaN 2.277226e+09 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone nc North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W NaN
50161 1.031715e+18 AnyaLogan RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:29:50 NaN en NaN 2.653554e+07 NaN <a href="http://twitter.com/download/android" ... ... 2018-08 8.0 Twitter for Android NC North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W NaN
50174 1.031715e+18 sirtou2 RT @RyMiko: So long #SilentSam https://t.co/lB... 2018-08-21 02:29:19 NaN en NaN 7.181703e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
50197 1.031715e+18 RyMiko So long #SilentSam https://t.co/lBOkIprCxd 2018-08-21 02:28:26 NaN en NaN 4.183678e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2018-08 8.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN

4599 rows × 41 columns

In [165]:
db.loc[(db.retweet_count>1000) & (db.favorite_count>10),['from_user',"retweet_count","favorite_count"]]
Out[165]:
from_user retweet_count favorite_count
1700 RaleighReporter 1608.0 3193.0
40274 DineshDSouza 2566.0 4953.0
50197 RyMiko 1183.0 3976.0

Operations on DataFrame

In [240]:
db.head(3)
Out[240]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
0 1.099692e+18 miriammarkfield RT @jordangreentcb: I filed my stories about t... 2019-02-24 15:24:51 NaN en NaN 1.022726e+08 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2019-02 2.0 Twitter Web Client North Carolina North Carolina North Carolina, USA 35.672964 -79.039292 35 40m 22.67s N, 79 2m 21.4508s W True
1 1.099563e+18 1st_Reduce_Harm RT @jordangreentcb: #silentsam https://t.co/55... 2019-02-24 06:53:50 NaN en NaN 9.645217e+17 NaN <a href="http://twitter.com/download/android" ... ... 2019-02 2.0 Twitter for Android Not "the Midwest", THE NORTH. Not "the Midwest", THE NORTH. Midwest, Natrona County, Wyoming, USA 43.411391 -106.280075 43 24m 41.0076s N, 106 16m 48.27s W False
2 1.099726e+18 SilentSamIAm RT @jordangreentcb: Antiracists tell neo-Confe... 2019-02-24 17:42:29 NaN en NaN 9.137753e+17 NaN <a href="http://twitter.com/download/iphone" r... ... 2019-02 2.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN

3 rows × 41 columns

In [241]:
db.tail(3)
Out[241]:
id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source ... yearmon month trans_sour pre no addr lat long point_x Accurarcy
59251 7.254739e+17 LokoVybe Bout to have #shot of some #SilentSam and munc... 2016-04-28 00:57:10 NaN en NaN 5.240004e+08 NaN <a href="http://www.twitter.com" rel="nofollow... ... 2016-04 4.0 Twitter for BlackBerry NaN NaN NaN NaN NaN NaN NaN
59252 7.087297e+17 haley_nm I apologize to anyone that I was talking to la... 2016-03-12 19:01:45 NaN en NaN 3.278226e+08 NaN <a href="http://twitter.com/download/iphone" r... ... 2016-03 3.0 Twitter for iPhone NaN NaN NaN NaN NaN NaN NaN
59253 7.065352e+17 LocoCravey RT @jdmar3: A monument to unknown dead people ... 2016-03-06 17:41:31 NaN en NaN 1.119392e+09 NaN <a href="http://twitter.com" rel="nofollow">Tw... ... 2016-03 3.0 Twitter Web Client Carrboro, North Carolina Carrboro, North Carolina Carrboro, Orange County, North Carolina, 27510... 35.910144 -79.075289 35 54m 36.5177s N, 79 4m 31.0422s W NaN

3 rows × 41 columns

In [170]:
db.iloc[:5].retweet_count
Out[170]:
0    13.0
1     3.0
2    28.0
3    25.0
4    28.0
Name: retweet_count, dtype: float64
In [171]:
db.iloc[:5].retweet_count + 5
Out[171]:
0    18.0
1     8.0
2    33.0
3    30.0
4    33.0
Name: retweet_count, dtype: float64
In [173]:
db.iloc[:5].retweet_count.count()
Out[173]:
5
In [174]:
db.iloc[:5].retweet_count.min()
Out[174]:
3.0
In [175]:
db.iloc[:5].retweet_count.max()
Out[175]:
28.0
In [176]:
db.iloc[:5].retweet_count.sum()
Out[176]:
97.0
In [180]:
db.iloc[:5].retweet_count.idxmax()
Out[180]:
'2'
In [185]:
user.iloc[:5].tweets_num / user.iloc[:5].created_days
Out[185]:
0    2.281867
1    0.357654
2    0.897810
3    0.086986
4    0.068835
dtype: float64

apply and map functions are also very useful when you want to do the same thing to a series or a row

In [191]:
def messify(x):
    return (x**2)+x-5

user.iloc[:5].tweets_num.apply(messify)
Out[191]:
0    1616707.0
1     639195.0
2     545377.0
3      67855.0
4      64765.0
Name: tweets_num, dtype: float64
In [194]:
user.iloc[:5].tweets_num.map(messify)
Out[194]:
0    1616707.0
1     639195.0
2     545377.0
3      67855.0
4      64765.0
Name: tweets_num, dtype: float64

The most knotty thing is the axis and level

In [225]:
d=pd.DataFrame([['inls101','f12',12,3,2,2],['inls101','f12',12,3],['inls103','f13',12,3,3,6]])
d.columns=['Course','Sem','x','y','z','d']
d=d.set_index(['Course','Sem'])
d
Out[225]:
x y z d
Course Sem
inls101 f12 12 3 2.0 2.0
f12 12 3 NaN NaN
inls103 f13 12 3 3.0 6.0
In [236]:
#default axis = 0
d.sum(axis=0)
Out[236]:
x    36.0
y     9.0
z     5.0
d     8.0
dtype: float64
In [232]:
d.sum(axis=0, level=0)
Out[232]:
x y z d
Course
inls101 24 6 2.0 2.0
inls103 12 3 3.0 6.0
In [233]:
d.sum(axis=0, level=1)
Out[233]:
x y z d
Sem
f12 24 6 2.0 2.0
f13 12 3 3.0 6.0
In [228]:
d.sum(axis=1)
Out[228]:
Course   Sem
inls101  f12    19.0
         f12    15.0
inls103  f13    24.0
dtype: float64

Missing Values

None is a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):

Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:

  • isnull(): Generate a boolean mask indicating missing values

  • notnull(): Opposite of isnull()

  • dropna(): Return a filtered version of the data

  • fillna(): Return a copy of the data with missing values filled or imputed

isna() is used to check if there is missing values

In [238]:
user.tweets_num.isna().head(5)
Out[238]:
0    False
1    False
2    False
3    False
4    False
Name: tweets_num, dtype: bool

The opposite is notna()

In [243]:
user.loc[user.geo_coordinates.notna()]
Out[243]:
user_screen_name tweets_num id_str from_user text time geo_coordinates user_lang in_reply_to_screen_name from_user_id_str ... isoutlier_if created_days eng_or_not pre no addr lat long point_x Accurarcy
2344 asvpgrime 3.0 1035324762132762624 asvpgrime 🔥 It’s going down like #silentsam TONIGHT 🔥th... 2018-08-31 01:33:53 loc: 35.9259,-79.0395 en NaN 3.689527e+09 ... 1.0 1268.0 English Chapel Hill, NC Chapel Hill, NC Chapel Hill, Orange County, North Carolina, USA 35.913154 -79.055780 35 54m 47.3551s N, 79 3m 20.808s W NaN
2965 piercefreelon 3.0 1069796106232578051 piercefreelon Looks like Chapel Hill needs a reminder. Repos... 2018-12-04 03:30:42 loc: 35.9069925,-79.0213309 en NaN 1.155899e+08 ... 1.0 3305.0 English Durham, NC Durham, NC Durham County, North Carolina, USA 36.018132 -78.875158 36 1m 5.27376s N, 78 52m 30.5695s W NaN
4921 kwameinc 2.0 1097883279267553281 kwameinc What you do when the world is watching makes h... 2019-02-19 15:39:06 loc: 35.9069925,-79.0213309 en NaN 4.761046e+08 ... 1.0 2597.0 English NaN NaN NaN NaN NaN NaN NaN
11531 ThumbleNiz 1.0 1031893353057714177 ThumbleNiz Full support for UNC protesters toppling #Sile... 2018-08-21 14:18:41 loc: 36.61515889,-83.72075767 en NaN 4.064058e+09 ... 1.0 1227.0 English Los Angeles, CA Los Angeles, CA LA, Los Angeles County, California, USA 34.053683 -118.242767 34 3m 13.2602s N, 118 14m 33.9608s W NaN
13071 firedog729 1.0 1032002796524990465 firedog729 #Dead #SilentSam #JobWellDone\r\r\n\r\r\n#Rp @... 2018-08-21 21:33:35 loc: 34.02734215,-118.27953236 en NaN 7.345886e+07 ... 1.0 3465.0 English Los Angeles, CA Los Angeles, CA LA, Los Angeles County, California, USA 34.053683 -118.242767 34 3m 13.2602s N, 118 14m 33.9608s W NaN
20941 NeuseNews 1.0 1087671425153073154 NeuseNews Mike Parker: 'Historical cleansing' continues ... 2019-01-22 11:20:50 loc: 35.262204,-77.5820908 en NaN 9.835448e+17 ... 1.0 333.0 English Kinston, NC Kinston, NC Kinston, Lenoir County, North Carolina, USA 35.262664 -77.581635 35 15m 45.5886s N, 77 34m 53.8871s W NaN
22591 barrysolaidback 1.0 1033413470140723200 barrysolaidback #SilentSam https://t.co/JdwmOPmYAW 2018-08-25 18:59:05 loc: 35.91387167,-79.0523305 en NaN 9.817781e+08 ... 1.0 2289.0 English Durham, NC Durham, NC Durham County, North Carolina, USA 36.018132 -78.875158 36 1m 5.27376s N, 78 52m 30.5695s W NaN
23444 charline_woods 1.0 1032045862350934016 charline_woods Watching all the UNC kids drink from the fount... 2018-08-22 00:24:42 loc: 35.9259,-79.0395 en NaN 5.349888e+08 ... 1.0 2541.0 English Raleigh, NC Raleigh, NC Raleigh, Wake County, North Carolina, USA 35.780398 -78.639099 35 46m 49.4317s N, 78 38m 20.756s W NaN
23785 lulunyc 1.0 909140231835783169 lulunyc Political discourse. #silentsam @ Silent Sam h... 2017-09-16 20:41:36 loc: 35.914,-79.0524 en NaN 2.204532e+07 ... 1.0 3662.0 English New York, NY New York, NY NYC, New York, USA 40.730862 -73.987156 40 43m 51.1028s N, 73 59m 13.7609s W NaN

9 rows × 41 columns

dropna() is used to delete missing values
fillna() is used to fill the missing values

In [262]:
db.iloc[:5,15:17]
Out[262]:
retweet_count favorite_count
0 13.0 NaN
1 3.0 NaN
2 28.0 NaN
3 25.0 NaN
4 28.0 NaN
In [263]:
db.iloc[:5,15:17].fillna(0)
Out[263]:
retweet_count favorite_count
0 13.0 0.0
1 3.0 0.0
2 28.0 0.0
3 25.0 0.0
4 28.0 0.0
In [267]:
db.iloc[:5,15:17].fillna(method="ffill", axis=1)
Out[267]:
retweet_count favorite_count
0 13.0 13.0
1 3.0 3.0
2 28.0 28.0
3 25.0 25.0
4 28.0 28.0

Now let me introduce a tool for missing values

In [272]:
import missingno
import seaborn as sns
sns.set()

missingno.matrix(user, labels=True)
Out[272]:
<matplotlib.axes._subplots.AxesSubplot at 0x142b53c7eb8>
In [270]:
missingno.bar(user)
Out[270]:
<matplotlib.axes._subplots.AxesSubplot at 0x142b3e11630>

This is another powerful tool to produce report on your dataset. It's interactive and based on JavaScript

But for some reason, it could not run on my machine...

In [ ]:
import pandas_profiling

pandas_profiling.ProfileReport(db.iloc[:100,:10])

Data Aggregation

Groupby
Groupby is one of the most widely used data aggregation tool in pandas. So, basically, it split the data into different chunks, and then, apply a funtion to each of them, and return the last value at last.

groupby

In [10]:
db.groupby(["lang_trans"]).id_str.count()
Out[10]:
lang_trans
Arabic                       4
Basque                       1
Catalan; Valencian           1
Croatian                     1
Danish                       6
Dutch; Flemish              22
English                  58585
English UK                 118
Finnish                      3
French                      74
German                      74
Greek, Modern (1453-)        8
Hebrew                       5
Hungarian                    2
Indonesian                   1
Italian                     87
Japanese                    52
Korean                       4
LOLCATZ                      1
Norwegian                    1
Polish                      11
Portuguese                  28
Russian                      2
Spanish; Castilian         113
Swedish                     10
Turkish                      1
Vietnamese                   1
Name: id_str, dtype: int64
In [18]:
db.groupby(["lang_trans"]).agg({"id_str":"count"})
Out[18]:
id_str
lang_trans
Arabic 4
Basque 1
Catalan; Valencian 1
Croatian 1
Danish 6
Dutch; Flemish 22
English 58585
English UK 118
Finnish 3
French 74
German 74
Greek, Modern (1453-) 8
Hebrew 5
Hungarian 2
Indonesian 1
Italian 87
Japanese 52
Korean 4
LOLCATZ 1
Norwegian 1
Polish 11
Portuguese 28
Russian 2
Spanish; Castilian 113
Swedish 10
Turkish 1
Vietnamese 1

There are two kind of data format.

  • One is to organize the data in a tree structure
  • The other is to structure the table as a table

longwide

In [21]:
db.groupby(["lang_trans",'trans_sour']).count().iloc[:,:2]
Out[21]:
id_str from_user
lang_trans trans_sour
Arabic Twitter for Android 1 1
Twitter for iPhone 3 3
Basque Twitter for Android 1 1
Catalan; Valencian Twitter Web Client 1 1
Croatian Twitter for Android 1 1
Danish Twitter Web Client 1 1
Twitter for Android 3 3
Twitter for iPhone 2 2
Dutch; Flemish Twitter Web App 2 2
Twitter Web Client 1 1
Twitter for Android 17 17
Twitter for iPhone 2 2
English rohingya Update 1 1
#BREAKING 2 2
5b2b04cc8c451264154992 1 1
5b37d2a786aeb453738826 1 1
5b40d42167e88445731584 1 1
Academic Chatter 2 2
AntifaTron 2.0 9 9
Aplos for Twitter 1 1
BLOX CMS 3 3
Bitly 1 1
Bizzybot 1 1
Blog2Social APP 3 3
Buffer 151 151
CAAACTIONORG 1 1
Cal Fact Check 1 1
Choqok 2 2
Cloudhopper 2 2
CoSchedule 2 2
... ... ... ...
Italian Twitter for iPhone 9 9
Japanese Twitter Web Client 32 32
Twitter for Android 3 3
Twitter for iPhone 17 17
Korean Twitter for Android 1 1
Twitter for iPhone 3 3
LOLCATZ Twitter for Android 1 1
Norwegian Twitter for iPhone 1 1
Polish Twitter Web Client 5 5
Twitter for Android 5 5
Twitter for iPhone 1 1
Portuguese Twitter Web App 2 2
Twitter Web Client 11 11
Twitter for Android 10 10
Twitter for iPad 1 1
Twitter for iPhone 4 4
Russian Twitter Web App 1 1
Twitter for Android 1 1
Spanish; Castilian Mobile Web (M2) 2 2
TweetDeck 2 2
Twitter Web App 5 5
Twitter Web Client 22 22
Twitter for Android 56 56
Twitter for iPhone 26 26
Swedish Twitter Web Client 4 4
Twitter for Android 3 3
Twitter for iPad 1 1
Twitter for iPhone 2 2
Turkish Twitter for Android 1 1
Vietnamese Twitter Web Client 1 1

217 rows × 2 columns

In [32]:
db.groupby(["lang_trans",'trans_sour']).count().iloc[:,1].unstack()
Out[32]:
trans_sour rohingya Update #BREAKING 5b2b04cc8c451264154992 5b37d2a786aeb453738826 5b40d42167e88445731584 Academic Chatter AntifaTron 2.0 Aplos for Twitter BLOX CMS Bitly ... otm_wptt phos-tweet1 preciousproverbs pxbern reweets test djt ret todaysmathematics twicca weirdrobots whotrendedit Оwly
lang_trans
Arabic NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Basque NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Catalan; Valencian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Croatian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Danish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Dutch; Flemish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
English 1.0 2.0 1.0 1.0 1.0 2.0 9.0 1.0 3.0 1.0 ... 2.0 1.0 2.0 2.0 1.0 1.0 1.0 1.0 2.0 3.0
English UK NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Finnish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
French NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
German NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Greek, Modern (1453-) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hebrew NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hungarian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Indonesian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Italian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Japanese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Korean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
LOLCATZ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Norwegian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Polish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Portuguese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Russian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Spanish; Castilian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Swedish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Turkish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Vietnamese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

27 rows × 140 columns

Pivot Table is a shortcut for groupby. If you use PivotTables in Excel, you will be familiar with the usage of this in pandas too.

In [37]:
pd.pivot_table(index="lang_trans", columns="trans_sour", data=db,values="id_str", aggfunc="count",margins=True )
Out[37]:
trans_sour rohingya Update #BREAKING 5b2b04cc8c451264154992 5b37d2a786aeb453738826 5b40d42167e88445731584 Academic Chatter AntifaTron 2.0 Aplos for Twitter BLOX CMS Bitly ... phos-tweet1 preciousproverbs pxbern reweets test djt ret todaysmathematics twicca weirdrobots whotrendedit Оwly All
lang_trans
Arabic NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 4
Basque NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Catalan; Valencian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Croatian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Danish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 6
Dutch; Flemish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 22
English 1.0 2.0 1.0 1.0 1.0 2.0 9.0 1.0 3.0 1.0 ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 2.0 3.0 58585
English UK NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 118
Finnish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 3
French NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 74
German NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 74
Greek, Modern (1453-) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8
Hebrew NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 5
Hungarian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
Indonesian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Italian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 87
Japanese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 52
Korean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 4
LOLCATZ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Norwegian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Polish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 11
Portuguese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 28
Russian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
Spanish; Castilian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 113
Swedish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 10
Turkish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
Vietnamese NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
All 1.0 2.0 1.0 1.0 1.0 2.0 9.0 1.0 3.0 1.0 ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 2.0 3.0 59216

28 rows × 141 columns

Wait, there is more simple ways if you only want to count the numbers. Let's introduce crosstab

In [40]:
pd.crosstab(db["lang_trans"],db["trans_sour"])
Out[40]:
trans_sour rohingya Update #BREAKING 5b2b04cc8c451264154992 5b37d2a786aeb453738826 5b40d42167e88445731584 Academic Chatter AntifaTron 2.0 Aplos for Twitter BLOX CMS Bitly ... otm_wptt phos-tweet1 preciousproverbs pxbern reweets test djt ret todaysmathematics twicca weirdrobots whotrendedit Оwly
lang_trans
Arabic 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Basque 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Catalan; Valencian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Croatian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Danish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Dutch; Flemish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
English 1 2 1 1 1 2 9 1 3 1 ... 2 1 2 2 1 1 1 1 2 3
English UK 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Finnish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
French 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
German 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Greek, Modern (1453-) 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Hebrew 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Hungarian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Indonesian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Italian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Japanese 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Korean 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOLCATZ 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Norwegian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Polish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Portuguese 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Russian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Spanish; Castilian 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Swedish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Turkish 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Vietnamese 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

27 rows × 140 columns

(If we still have time, we can do) Pratical-Oriented Data Visualization 🌟

This part is also practical oriented, so I don't want to delve into the gritty-nitty of the data viz. I only focus on how to make a visualization fast and simple.

There are many ways to make a plot in Python.

  • The most fundamental one, also the most customerized one: Matplotlib

Matplotlib

In [3]:
from __future__ import print_function, division
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
In [4]:
plt.scatter(x="tweets_num",y="created_days",data=user.fillna(0))
Out[4]:
<matplotlib.collections.PathCollection at 0x213d8f4f978>

But because it provides many customerized functions, so the learning curve is steep and you have to add the elements one by one.
And, the dafault style is so UGLY
That's why I recommend to use Seaborn

Seaborn is a highly-encapsulated pacakage built on Matplotlib. But it is waaaaaaaay much user-friendly and BEAUTIFUL
It's like Matplotlib with makeups

Pandas Built-in

  • If you just want some simple plot, you could use the built-in vizs in the pandas. It would produce the same style as the matplotlib
In [5]:
user.plot(kind="scatter",x="tweets_num",y="created_days")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x213d8bee400>

Seaborn

In [54]:
import seaborn as sns
sns.set()
In [77]:
fix, ax=plt.subplots(figsize=(15, 10))
sns.scatterplot(data=user,x="tweets_num",y="created_days",hue="lang_trans", palette="Set2", size="tweets_num",
                alpha=0.3, x_jitter=True, ax=ax).set_title("Test");

Interactive Plot with Bokeh & Plotly

There are some interactive plot packages, the most popular ones are bokeh and plotly

In [85]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool


output_notebook()

p = figure()
p.circle(x="tweets_num",y="created_days",
         source=user,
         size=10, color='green')

p.title.text = 'test'
p.xaxis.axis_label = 'tweet_num'
p.yaxis.axis_label = 'creatd days'

hover = HoverTool()
hover.tooltips=[
    ('tweet_num', '@$tweet_num'),
    ('creatd days', '@$creatd days')

]

p.add_tools(hover)

show(p)
Loading BokehJS ...

So, you can see tools like bokeh and plotly are very cumbersome to code...
But I'm going to introduce plotly express, a new package launched in 2019(Yeah, you are learnig the cutting-edge packages in the Python world), that is build upon plotly, and has the similar syntax as Seaborn

Plotly Express

In [6]:
import plotly_express as px
In [7]:
px.scatter(user.fillna(0), x="tweets_num",y="created_days", marginal_x="box", title="test", color="lang_trans")

Any Questions?

Thanks agian for your patience!